Spotting Words in Latin, Devanagari and Arabic Scripts
نویسندگان
چکیده
A system for spotting words in scanned document images in three scripts, Devanagari, Arabic and Latin is described. Three main components of the system are a word segmenter, a shape based matcher for words and a search interface. The user gives a query which can be either a word image or text. The candidate words that are searched in the documents are retrieved and ranked, where the ranking criterion is a similarity score between the query and the candidate words based on global word shape features. This renders the word spotting technique to be independent of the script used. The performance of system is seen to be better for printed text as compared to handwritten. For handwritten English, a precision of 60% was obtained at a recall of 50%. An alternate approach comprising of prototype selection and word matching, that yields a better performance for handwritten documents is also discussed. For printed Sanskrit documents, a precision as high as 90% was obtained at a Recall of 50%.
منابع مشابه
Zone-based Keyword Spotting in Bangla and Devanagari Documents
In this paper we present a word spotting system in text lines for offline Indic scripts such as Bangla (Bengali) and Devanagari. Recently, it was shown that zone-wise recognition method improves the word recognition performance than conventional full word recognition system in Indic scripts [29]. Inspired with this idea we consider the zone segmentation approach and use middle zone information ...
متن کاملWord Searching in Scene Image and Video Frame in Multi-Script Scenario using Dynamic Shape Coding
Retrieval of text information from natural scene images and video frames is a challenging task due to its inherent problems like complex character shapes, low resolution, background noise, etc. Available OCR systems often fail to retrieve such information in scene/video frames. Keyword spotting, an alternative way to retrieve information, performs efficient text searching in such scenarios. How...
متن کاملCross-language Framework for Word Recognition and Spotting of Indic Scripts
Handwritten word recognition and spotting of low-resource scripts are difficult as sufficient training data is not available and it is often expensive for collecting data of such scripts. This paper presents a novel cross language platform for handwritten word recognition and spotting for such low-resource scripts where training is performed with a sufficiently large dataset of an available scr...
متن کاملA Novel Approach for Word Retrieval from Devanagari Document Images
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character recognition (OCR). For indigenous scripts of India, there are very few OCRs that can succe...
متن کاملGeneralization of Hindi OCR Using Adaptive Segmentation and Font Files
In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006